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A Comparison Study of Item Exposure Control Strategies in MCAT 

Authors: Xiuzhen Mao; Burhanettin Ozdemir;Yating Wang; Tao Xin 
Publication date:2016/10 
Abstract 

Four item selection indexes with and without exposure control are evaluated and 
compared in multidimensional computerized adaptive testing (CAT). The four item selection 
indices are D-optimality, Posterior expectation Kullback-Leibler information (KLP), the 
minimized error variance of the linear combination score with equal weight (VI), and the 
minimized error variance of the composite score with optimized weight (V2). The maximum 
priority index (MPI) method for unidimensional CAT and two item exposure control methods 
(the restrictive threshold (RT) method and restrictive progressive (RPG) method, originally 
proposed for cognitive diagnostic CAT) are adopted. The results show that: (1) KLP, 
D-optimality, and V 1 perform well in recovering domain scores, and all outperform V2 in 
psychometric precision; (2) KLP, D-optimality, VI, and V2 produce an unbalanced distribution 
of item exposure rates, although V 1 and V2 offer improved item pool usage rates; (3) all the 
exposure control strategies improve the exposure uniformity greatly and with very little loss in 
psychometric precision; (4) RPG and MPI perform similarly in exposure control, and are both 
better than RT. 

Keywords: Multidimensional Item Response Theory; Computerized Adaptive Testing; Item 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 


A comparison study of item exposure strategies in MCAT 

2 


Selection Methods; Exposure Control Strategy; Psychometric Precision. 


Introduction 

The fact that test items are chosen sequentially and adaptively in computerized adaptive 
testing (CAT) has broken the traditional testing mode in which thousands of people respond to 
the same items at the same time. Nowadays, CAT is increasingly favored by test practitioners and 
researchers for its higher efficiency, shorter test time, and lower pressure than paper and pencil 
(P&P) testing. Another more fascinating characteristic of CAT is that different item response 
models can be applied, including unidimensional, multidimensional, and cognitive diagnostic 
models. 

Multidimensional computer adaptive testing (MCAT) possesses the advantages of both 
multidimensional item response theory (MIRT) and CAT. On the one hand, a large number of 
studies based on different test conditions have arrived at the conclusion that MCAT provides 
higher efficiency than unidimensional CAT. For example, Segall (1996) employed simulated data 
based on nine adaptive power tests of the Armed Services Vocational Aptitude Battery (ASVAB) 
to show that MCAT reduced by about one-third the number of items required to generate equal or 
higher reliability with similar precision to unidimensional CAT. Luecht (1996) demonstrated that 
MCAT can reduce the number of items for tests with content constraints by 25-40%. Further, 


Wang and Chen (2004) illustrated the higher efficiency of MCAT compared with unidimensional 
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CAT under different latent trait correlations, latent numbers, and scoring levels. On the other 
hand, the fact that several ability profiles are estimated simultaneously indicates the ability of 
MCAT to offer detailed diagnostic information regarding domain scores and overall scores. The 
advantages of multi-dimensionality and high efficiency make MCAT better suited to real tests 
than unidimensional CAT. Hence, many studies on MCAT have considered real item pools, such 
as TerraNova (Yao, 2010), American College Testing (ACT) (Veldkamp & van der Linden, 2002), 
and ASVAB (Segall, 1996; Yao, 2012, 2014a). 

Since Bloxom and Vale (1987) extended unidimensional CAT to MCAT, it has received 
increasing attention, and several breakthroughs have been reported in the last decade. Among the 
studies on ability estimation methods, the testing stopping rule, and item replenishing, item 
selection rules have become popular because of their important role in affecting the test quality 
and psychometric precision. Thus, most researchers focus on proposing new item selection 
indices to decrease errors in ability estimation. However, Yao (2014a) pointed out that most item 
selection methods tend to select a particular type of item, leading to the problem of unbalanced 
item utility. She also gave an example of the Kullback-Leibler index, which prefers items that 
have either a high discriminator at each dimension or significantly different discriminators 
among different dimensions. As another example, the D-optimality index tends to select items 
with a high discriminator in only one dimension (Wang, Chang, & Boughton, 2011). Nowadays, 
CAT is increasingly used in many kinds of tests. Hence, item exposure control is important in the 
application of MCAT, especially for its application to high-stakes tests. Furthermore, few studies 


have investigated this problem in MCAT. Hence, the goal of the present study is to evaluate the 
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performance of some exposure control techniques in MCAT. 

To date, many of the exposure control methods used in unidimensional CAT have been 
generalized to MCAT. For example, Finkelman, Nering and Roussos (2009) extended the 
Sympson-Hetter (S-H) (Sympson & Hetter, 1985) and Stocking-Lewis (S-L) (Stocking & Lewis, 
1998) methods to MCAT. They found that all the S-H, generalized S-H, and generalized S-L 
methods do well in controlling the maximum item exposure rates. However, simulation 
experiments to create the exposure control parameters are time-consuming. Furthermore, there 
still exist some underexposed items. In addition, Yao (2014a) compared S-H with the fix -rate 
procedure. The fix-rate procedure is similar to the maximum priority index (MPI) method 
proposed by Cheng and Chang (2009) for unidimensional CAT. She showed that the S-H method 
performs better in terms of test precision, whereas the latter gives a higher item bank usage and 
controls the maximum item exposure rate well. 

The | a jX - a j2 | -stratification method (Lee, Ip, & Fuh, 2008) is based on the principle of 

the a-stratification method (Chang & Ying, 1999). The item pool is stratified according to the 
absolute value of a jX - a j2 , where a = {a jX ,a j2 ) denotes the item discrimination vector of item 
j . It was reported that the \a jX - a j2 | -stratification method is effective in combating overused 
items and increasing the item pool usage. However, this method cannot guarantee that no items 
are overexposed. Thus, Huebner, Wang, Quinlan, and Seubert (2015) combined 
I a j\ ~ a / 2 1 -stratification with the item eligibility method (van der Linden & Veldkamp, 2007) 

with the aim of enhancing the balance of item exposure. This combination method improves the 
exposure rates of underused items and suppresses the observed maximum item exposure rate. 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 


A comparison study of item exposure strategies in MCAT 

5 

However, these two methods are restricted to tests with two dimensions. Constructing a suitable 
functional of the discrimination parameter for tests with more than two dimensions remains an 
important research problem. 

It is well known that the uniformity of item exposure rates is affected by the numbers of 
overexposed and underexposed items. Of the above mentioned exposure control methods used in 
MCAT, the S-H, generalized S-H, generalized S-L, fix -rate, and item eligibility methods perform 
well in suppressing the maximum item exposure rates, and the | a . -a j2 | -stratification method 

effectively improves the utility of underexposed items. Although the combination method used 
by Huebner, et al. (2015) performs well in both aspects, it is only suitable for tests with two 
dimensions. 

The uniformity of item exposure rates and measurement precision are the two most 
important considerations during the application of MCAT to practical tests, especially for 
high-stakes tests. Because they always trade-off with one another, practitioners hope to find 
some item selection method that not only guarantees test precision, but also decreases the 
maximum item exposure rate while increasing the exposure rate of underexposed items. 

However, there are no methods that can effectively balance item exposure rates for tests with 
more than two dimensions. In addition, two exposure control methods have not been studied for 
MCAT: the restrictive threshold (RT) method and the restrictive progressive (RPG) method. It 
has been reported that they perform well in balancing the item exposure rate of cognitive 
diagnostic CAT (Wang, Chang, & Huebner, 2011). Therefore, the focus of the present study is 
whether RT and RPG can simultaneously suppress the maximum item exposure rates and 
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increase the exposure rates of underexposed items without losing psychometric precision in 
MCAT. Further, their performance is compared with that of the MPI method. 

In the remainder of this paper, we first introduce the MIRT model employed in this 
study and the ability estimation method. Then, some item selection indices and exposure control 
strategies are described. The performance of four item selection indices with and without each of 
the three exposure control strategies under different latent trait correlation levels are examined 
through a series of simulation experiments. The results, conclusions, and discussion are given in 
the final two sections of the paper. 

MIRT model and ability estimation method 
Multidimensional Two-Parameter Logistic (M-2PL) Model 

MIRT models are usually classified as compensatory or non-compensatory based on 
whether a strong ability can compensate for other weak profiles. Bolt and Lall (2003) reported 
that both types are able to fit the data generated by non-compensatory models, but 
non-compensatory models cannot match the data generated from compensatory models. Thus, 
because of the advantages of compensatory models and the wide usage of MCAT in dealing with 
dichotomous items (van der Linden, 1999; Veldkamp & van der Linden, 2002; Mulder & van der 
Linden, 2010), the M-2PL model was adopted to simulate item parameters and generate item 
responses. 

For some item j , M-2PL includes a scalar difficulty parameter b j and discrimination 
vector aj = ( a jX ,a j2 ,...,a JD ) T (McKinley & Reckase, 1982), where T denotes the transpose and 
D is the number of dimensions. For an examinee with ability 0 = (d l ,0 2 ,...,d D ) T , the item 
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response function can then be described as: 

Pj(0) = P(Xj=\\e,aj,bj) = 


1 


l + exp[-(5, -O-bM 


( 1 ) 


where aj -b j = a jt ■ 0, - b j denotes a straight line in D-dimensional space. The 

compensatory features of M-2PL originate from the fact that all examinees giving equal aj ■ 0 
possess the same response probability. 

Ability Estimation Method: Maximum a Posteriori (MAP) Estimation 

Yao (2014b) compared MAP, expected a posteriori (EAP), and maximum likelihood 
estimation (MLE) in a simulation experiment using item parameters estimated from the ASVAB 
Armed Forces Qualification Test. She pointed out that: (a) MLE generates smaller 
bias and larger root mean square error (RMSE), whereas MAP and EAP using strong prior 
information or standard normal priors produced higher precision in the recovery of ability, (b) 
EAP and MAP behave similarly, but EAP takes a longer time than MAP. Recently, Huebner, et al. 
(2015) compared EAP with MLE in MCAT, and proved that EAP always produces more stable 
results and lower mean square error in the ability estimators than MLE. MAP is adopted in this 
study for its competitive precision and easier computation compared with EAP in MIRT. 

Let / (0) denote the prior density function of 0 . This is assumed to be a multivariate 
normal distribution with mean value p 0 and variance-covariance matrix I 0 . For convenience, 
the response to item j is indicated as x j , and X k , represents the response vector of the first 
k - 1 items administered. The posterior density function of 0 is denoted by f(0 X k ,) . 

Based on Bayes’ theorem, / ( 0 \ X k _ { ) oc L{X k _ x \ 0) ■ f (0) , where L(X k _ r \ 0) denotes the 
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likelihood function. Hence, the goal of MAP is to find the mode that maximizes the posterior 
density function / (6 \ X k , ) . That is, the ability estimator 0 MAP is equivalent to the solution of 

dl°g/(# | X k _ x ) _ q ^ _ i2,..., D). Furthermore, Newton-Raphson iteration can be used to 

86, 

solve this equation; for details, see Yao (2014b). 

Item Selection Indices and Exposure Control Strategies 

To simplify the description, we first introduce some notation. N represents the number 

of examinees, and L is the test length. Set R refers to the item bank, which has a capacity of 

M . Set R k _ j =R\ \i ] ,i 2 ,...i k ,} and 0 k 1 express the remainder of the item bank and the 

temporary estimator after administering the first k - 1 items, respectively. 

Item Selection Indices 

The following four indices are chosen as item selection criteria based on the 
consideration of computation complexity and running time. 

D-optimality. The Fisher information of each item in MIRT is no longer a number, but a 
matrix. Specifically, the Fisher information for the /th item in M-2PL is 

I j (0) = P j (0)-(\-P j {0))-(d T j d j ). (2) 

After k - 1 items have been administered, the estimators form an ellipse or sphere V k _ 1 . 
To decrease the size or volume of V k _ x as quickly as possible, Segall (1996) proposed that the 
kth item should maximize the determinant of the posterior test Fisher information matrix. Thus, 
the Bayesian item selection rule is expressed as 


D k - max{| I k l {6 k ~ l ) + 1 j(9 k ~ l ) + 'L~q 


JXR, 


( 3 ) 
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where I k _ l (0 t ') represents the test information of the first k - 1 items already be administered 
calculated at the current estimated ability, and IXO k ~' ) indicates the Fisher information of the 
/th ( / e R k _ j) candidate item. This method was called D-optimality by Mulder and van der 
Linden (2009), and the item with the largest D k is chosen from the remainder pool. 

Posterior Expected Kullback-Leibler Information (KLP). This method is obtained by 
weighting the KL information according to the posterior distribution of ability. That is, the Mi 
item is selected according to 

KLP k =m^{\ § KL j 0 k -\6)- f{6\X k _ l )de, jeR k (4) 

where 


1*-1 Z7 i r Pj( x j\0,aj,bj) 


KL J {0*- i ,d) = E 9 log[ 


Pjixjie^'Sj.bj) 


-] 


= Pj{0) log 


Pj{0) n 


Pjipb- 1 ) 


(1 -P, (0)) log 


(1 -PX0 k - x )) 


(5) 


The integral interval is generally narrowed to simplify the computation, and (9) is replaced with 
KLP k = m^x{[ e j ti +ri KLj{§ k ~ l ,0)- \X k l )dO l ■■■dd D , jeR k l }, (6) 

J6\ -Yj JO D -Yj 

where y. usually takes a value of 3 / V7. 

Minimum Error Variance of the Linear Combination Score with Equal Weight 
(VI). From the perspective of error variance, van der Linden (1999) suggested that the Mi item 
should minimize the error variance of the composite score G a = O s ■ w, . Let SEM{6 a ) 


denote the standard error of measurement (SEM) for composite score 0 a . Yao (2012) derived 
the formula SEM{0 a ) = (V(G a )) 12 = (wV{0)w T ) vl , where V{9) is usually approximated by 
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• Given equal weights w = (1/ D, 1 ID,..., 1 ID) among the different dimensions, the 
item that minimizes SEM{O a ) will be selected by V 1 . 


Minimum Error Variance of the Linear Combination score with Optimized Weight 
(V2). The weight that minimizes the SEM of the composite ability is named the optimal weight. 
Yao (2012) proved the existence of the optimized weight, and derived its formula as 


w= ■ 


Z D yD 

0=1 £jl=l °' 


•[ 1 , 1,...,1 u - /*_,(<?) 


(V) 


In this expression, b ol denotes the element of I k fO) located on the oth row and Ith 
column. The procedure of V2 involves finding the optimal weight vector, then calculating SEM 
for each candidate item according to the optimal weight. Finally, the item with the lowest SEM is 
selected from the remainder pool. Note that the optimal weight is updated after administering 
each item. Thus, the only difference between V2 and VI is in the determination of the weight 
used to compute SEM{O a ) . 

Strategies of Item Exposure Control 

The RT and RPG methods proposed by Wang, et al. (2011) are two exposure control 
methods used in cognitive diagnostic CAT. Both can be easily generalized to MCAT. 

The RT method. In the RT method, a shadow item bank is constructed at the beginning 
of each test by removing all overexposed items from the original item bank. Each item is then 
selected at random from the candidate item set constructed beforehand. Let “Index” denote the 


value of the item selection indices. The candidate item set includes all items whose information 
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values lie in [ma x(Index) - 5, max( Index)] for both D-optimality and KLP or 
[m\n( Index), m\n( Index) + S] for VI and V2. The constant 5 is defined as 
5 = \max( Index) - min {Index)] ■ (1 -kl L ) /! . Larger values of [i give a shorter information 
interval length. As a result, the measurement precision is improved by decreasing the unifonnity 
of the item exposure distribution. In summary, (i is used to balance the requirements of item 
exposure rate control and measurement precision. In this study, we use [i =0.5. 

The RPG Method. The kth (k = 1, 2, . . ., L) item is selected according to (8) for 
D-optimality and KLP, and according to (9) for VI and V2. They are 

i k = max{(l - er / r'" ax ) ■ [(1 -kl L)u y + Index ■ x f]k l L], j e S k ,} (8) 

and i k erj I r mm )-\(l-k I L)Rj +(C - Index j)x J3k I L], jBS k _ l }, (9) 

where er denotes the observed exposure rate of item j and r max denotes the allowed 
maximum exposure rate. Let H* be the maximum item information in S k , . Then, « is 
uniformly extracted from interval (0, H*) . The parameter [} plays the same role and takes the 
same value as in the RT method. The constant C should be greater than all the SEMs; in this 
study, we set C = 10000. Note that SEM is always very large for the first several items, and 
decreases rapidly to less than 1000. Thus, it is better to set C to be greater than 1000. 

The maximum priority index method (MPI). According to Cheng and Chang (2009), 
the priority index ( PI) of item j with the requirement of the maximum exposure rate is 
expressed as 

r max - n . I N 

m - I 


max 


■ Index j , 


( 10 ) 
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where n i represents the administration frequency of item j , and “ index ” refers to the 
D-optimality or KLP index. Finally, the task of the MPI method is to identify the item with the 
largest PI. The role of C is similar to that in RPG. For VI and V2, PI j should be changed 

accordingly, that is 

r max -n./M 

PI,= J - (C- Index,). (11) 

r 

Method 

A simulation study was conducted to evaluate and compare the effectiveness of the above 
exposure control methods. Matlab (version7. 10.0.499) was used to write MCAT codes and run 
the simulation experiments. 

Design of Simulation Study 

Item Bank Construction. Although Stocking (1994) suggested that the pool should contain 
at least 12 times as many items as the test length, many simulation studies on MCAT have used a 
more restrictive item pool. For example, the item pool used by van der Linden (1999) contained 
500 items while the test length was 50; Lee, et al. (2008) used an item pool of 480 items with test 
lengths of 30 and 60; and the item pools described in Veldkamp and van der Linden (2002) and 
Mulder and van der Linden (2009) contained fewer than 200 items while the test length was 
greater than 30. Thus, it is reasonable to construct an item pool of 450 items for a test length of 
30. 


To simplify the experimental conditions, most simulation studies generate item 


parameters and item responses according to M-2PL or M-3PL with the assumption that there are 
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two or three dimensions (van der Linden, 1999; Veldkamp & van der Linden, 2002; Lee et al., 
2008; Mulder & van der Linden, 2009; Finkelman et al., 2009; Wang, Chang, & Boughton, 2013; 
Wang & Chang, 2011). Hence, without loss of generality, the items in our simulation contained 
three dimensions, and the item parameters of the M-2PL model were generated in a similar way 


to those of Yao and Richard (2006) and Wang and Chang (2011). Specifically, (a jl ,a J2 ,a j3 ) for 


item j(j = 1,2, ...450) were drawn from log N(0, 0.5) independently and b-(J = 1,2, ...450) 
were drawn from /V(0,1) . 

Examinees and Item Responses. All 5000 examinees were simulated uniformly from a 
multivariate normal distribution, as in previous research (Wang & Chang, 2011; Yao, Pommerich, 
& Segall, 2014; Wang et al., 2013). Three levels of correlation were considered in the 
experiments. The mean ability was [0, 0, 0] and the variance-covariance matrix was 


P P^ 
P 1 P 
P P 1 


{p = 0.3, 0.6, 0.8) . 


Let Py and x denote the correct response probability and actual response (0 or 1) 
corresponding to the yth ( / = 1,2, ...,450) item and the zth (z = 1,2,. ..,5000) examinee. Py was 
computed from the M-2PL model, and u :/ was selected uniformly from (0, 1). We set x = 1 if 
R > u, . Otherwise, if R < u ;; , x„ = 0. 

L J U l J l J J 

Item Selection Methods. Four item selection indices with and without the three exposure 
control methods yields a total of 16 item selection strategies. 


Estimation of Ability. The initial abilities were selected from the standard multivariate 
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normal distribution. MAP was used to update the domain abilities during the test, and 
multivariate standardized normality was applied as the prior distribution. 

Evaluation Criteria. The bias and mean square error (MSE) of each dimension were 
used to evaluate the precision of the ability estimators. They were computed as 

Bias, = -3) ( Z = U,3), 02) 

and MSE , = (^/ ~ 6 (/ = 1,2,3). (13) 

To assess the equalization of exposure rates, we used (a) the number of items never 
reached and the number of items with exposure rates greater than 0.2, (b) the / 2 statistic, and 
(c) the test overlap rate. The / 2 statistic was calculated as 


r = (m 

Smaller values of indicate smaller differences between the observed and expected item 
exposure rates. Finally, the test overlap rate was computed according to the expression proposed 
by Chen, Ankenmann, and Spray (2003): 


T 



L_ 

M' 


hi (15), S 2 er denotes the variance of item exposure rates. Generally, smaller values of T 
demonstrate more balanced item utility. 


(15) 


Results 


Results of Ability Estimation 


The differences in bias between two arbitrary dimensions of each method were so small 
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that Figure 1 presents the mean bias of three dimensions. Figure 2 lists the MSEs of each 
dimension for the different item selection methods and correlation levels. 

It is easy to summarize the following results: (a) the biases generated by D-optimality, 
VI, and V2 are similar and greater than the bias produced by KLP, and (b) for each dimension, 
KLP produces the smallest MSE, followed by D-optimality, VI, and V2. Generally, it is easy to 
sort the indices into descending order of KLP, D-optimality, VI, and V2 according to their 
measurement precision. 

The effects of item exposure control methods on the psychometric precision were 
checked through three aspects. First, from Figure 1, the item exposure strategies have no 
significant effect on the bias, as the biases produced by the same item selection index using 
different exposure control methods are similar. 

Second, the results of each item selection index with and without item exposure control 
can be compared. From Figure 2, all the item exposure strategies led to an increase in MSE 
except for V2. The MSE of V2 was larger than that of V2-RT in most of the cases. The decreased 
measurement precision may result from the characteristics of V2 in improving the item pool 
utility. Overall, using an exposure control strategy always decreases the measurement precision. 

Furthermore, when the item exposure control methods were combined with D-optimality, 
KLP, or V2, their performance differed considerably in terms of the measurement precision. 
However, all the item exposure control methods yielded similar measurement precision when 


combined with VI . hi addition, a higher level of ability correlation seems to narrow the gap in 
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the precision generated by different exposure control methods when combined with the same 
item selection index. 

Finally, we can compare the results of different item exposure control methods. RT 
always produced the lowest MSE values, thus giving higher measurement precision than RPG 
and MPI. RPG and MPI performed similarly, although their precision under different item 
selection indices varied to some degree. The performance of RT and RPG was in accordance 
with that reported by Wang et al. (2011). Overall, the general order of different exposure control 
methods sorted by decreasing measurement precision was RT, RPG, and MPI. 

Results of Item Exposure Rates. The item exposure rates associated with each item 
selection index with and without exposure rate control are presented in Table 1 and Figures 3-4. 

First, it is easy to infer that the exposure rates are distributed unevenly for D-optimality, 
KLP, VI, and V2. Taking D-optimality and KLP for illustration, they generate the lowest item 
bank usage rates and the largest overexposed item and test overlap rates. Although the number of 
never-reached items in VI and V2 is close to 0, and the test overlap rates and / 2 values are 
smaller than those of D-optimality and KLP, these exposure rate control methods still produce an 
unsatisfactory item exposure rate distribution. These characteristics can be clearly observed in 
Figure 4(a), where the exposure rates are depicted in ascending order for each of the four item 
selection indices. In addition, the results for VI and V2 obtained from this study coincide with 
those reported by Yao (2014a). 

Second, all the exposure control methods improved the uniformity of exposure rates 


significantly in terms of increasing item bank usage and lowering overexposed item rates, test 
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overlap rates, and / 2 . According to Table 1, RPG outperfonned the other methods in most cases, 
although MPI performed similarly. From Table 1, it is apparent that all the item exposure 
distributions follow the same pattern when different item selection indices are combined with the 
same exposure control method. Hence, Figure 4(b) only illustrates the exposure rate distributions 
of the exposure control strategies combined with KLP. 

In addition, different characteristics of the item exposure rate distribution were observed 
in different item exposure control methods. From Figure 3, it can be seen that the item pool 
usage rate reaches 100% for all methods except KLP -MPI. In other words, all item exposure 
methods significantly improve the item pool usage. Checking the overexposed items, both RPG 
and MPI produced more overexposed items than RT under most test conditions. Generally, RT is 
able to control the item exposure rates to be lower than the allowable maximum value,, whereas 
both RPG and MPI result in some items with exposure rates greater than 0.2. 

Further, it is worth pointing out some special findings when it comes to discussing certain 
exposure control methods. First, compared to D-MPI, VI -MPI, and V2-MPI, KLP-MPI 
generated a more unbalanced item exposure rate distribution. Second, when RPG was used with 
VI or V2, there were always one or two items exposed to everybody. Checking the internal 
results of VI -RPG and V2-RPG revealed that many error variance values in Matlab were labeled 
“NaN” in the case of choosing the first or second item. In other words, it can be in ferred that the 
overexposed items in VI -RPG and V2-RPG were mainly due to the non-distinctive item 
information matrix in VI and V2. Furthermore, the test overlap rate and / 2 of Vl-RPG and 
V2-RPG were affected by the first one or two administered items accordingly. 
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Overall, although the item exposure control strategies produced different patterns of item 
exposure rates, they all considerably improve the balance of the item exposure distribution. This 
can be seen from comparing Figure 4(a) and 4(b). In addition, the trade-off between the 
measurement precision and the item exposure distribution is also displayed in the results. 

Conclusions and Discussions 

Many studies have acknowledged the advantages of CAT over P&P tests and 
computer-based tests, such as the decrease in test length, increase in measurement precision, and 
better model fits. Along with the obvious advantages of MCAT, choosing the most appropriate 
item selection rule is a vital step for a successful application (Wang & Chang, 2011). Although 
the proposed item selection methods yield good results in precision, they are vulnerable to the 
issue of dealing with overexposed items (those that are used too often) and underexposed items 
(used too rarely). As a solution to this problem, different item exposure control methods have 
been adopted and used together with different item selection methods. 

This study has examined the performance of four item selection indices combined with 
different exposure control methods in MCAT. 

Simulations showed that V2 outperforms D-optimality, KLP, and V 1 with respect to 
higher item bank usage rates, fewer overexposed items, and lower test overlap rates. Generally, 
the results of all item selection indices without using item exposure control were unsatisfactory 
with respect to item exposure statistics. The results indicate that, without using item exposure 
control, the item selection indices can be sorted in order of psychometric precision as KLP, 
D-optimality, V 1 , and V2. In addition, when using item exposure control methods, the 
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measurement precision tended to decrease in all item selection indices. 

In comparing the item exposure rate distribution generated by different item exposure 
control methods, RPG outperformed the other methods in most cases, although MPI performed 
similarly. The RT method gave the worst performance. Furthermore, each item exposure control 
method yields the same exposure rate pattern under different item selection indices. When it 
comes to comparing the measurement precision, the performance of the different exposure 
control methods can be ordered as RT, RPG, and MPI. This kind of trade-off between 
measurement precision, utility of item pool, and evenness of item exposure rate has been 
observed in many studies (Chang & Twu, 1998). in other words, the measurement precision 
needs to be sacrificed, to some extent, to keep the exposure rate at the desired value. 

Both the present study and the work of Wang et al. (2011) showed that the measurement 
precision of the RT method was higher than that of the RPG method under the same test 
conditions, and the RT method performed slightly worse than RPG in the evenness of the item 
exposure distribution. In conclusion, among the three exposure control methods examined in this 
study, both RT and RPG offer balanced precision and item exposure control, whereas MPI 
performed well in controlling the item exposure rate with a noticeable loss in precision. 

Several issues regarding item selection methods for MCAT deserve further investigation. 
First, although D-optimality, VI, and V2 are much faster than KLP, the run-time usually 
increases with the number of test dimensions. As a consequence, time-consuming methods can 
hinder the practice of MCAT in dealing with complex test conditions. In fact, the benefits of 


MCAT over unidimensional CAT mainly he in the detailed cognitive information obtained based 
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on multiple dimensions. Hence, there is a need for more work on algorithms that reduce the 
computation time of the item selection methods, or simplified and valid item selection methods 
based on existing rules, such as the two simplified KL indexes provided by Wang et al. (2011). 

Second, the test measurement precision of each dimension can be guaranteed by most 
MCAT item selection methods automatically, but thousands of other constraints are encountered 
in real tests. Hence, it would be useful to research how to deal with nonstatistical constraints in 
MCAT. 

Third, polytomous items such as opening responding items and construction items have 
now begun to appear in CAT (Bejar, 1991). There is no doubt that research on polytomous items 
will increase in popularity. However, most current research on MCAT deals with dichotomous 
items. Thus, it is important for researchers to propose item selection methods or extend methods 
for dichotomous items, such as the mutual information index, KL, and Shannon entropy, to deal 
with polytomous items. 
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Table 1. Item exposure statistics of each method 


Methods 

Overlap rate 

x 2 


Methods 

Overlap rate 

X 2 

D 

0.408/0.23/0.23 

152.6/75.14/75.14 


VI 

0.253/0.241/0.237 

83.5/78.78/76.29 

D-RPG 

0.067/0.065/0.068 

3.78/2.53/3.97 


Vl-RPG 

0.124/0.124/0.124 

25.90/25.95/25.83 

D-RT 

0.123/0.122/0.123 

25.63/24.89/24.86 


VI -RT 

0.099/0.101/0.098 

14.76/14.72/14.84 

D-MPI 

0.075/0.073/0.069 

0.97/0.974/0.96 


Vl-MPI 

0.072/0.073/0.072 

2.52/2.59/2.55 

KLP 

0.145/0.238/0.325 

42.02/78.54/96.15 


V2 

0.114/0.113/0.113 

21.37/20.83/20.81 

KLP-RPG 

0.078/0.074/0.074 

7.23/3.40/3.45 


V2-RPG 

0.124/0.125/0.124 

15.89/25.92/15.90 

KLP-RT 

0.121/0.119/0.118 

24.45/23.47/23.10 


V2-RT 

0.092/0.086/0.093 

11.64/8.61/11.88 

KLP-MPI 

0.087/0.098/0.098 

10.35/14.29/14.19 


V2-MPI 

0.074/0.077/0.074 

3.29/4.44/3.29 


Note: In each cell, results represent correlation of 0.3/0. 6/0. 8. 
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Figure 1 . Mean bias of the three ability dimensions under each item selection method 
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4 Figure 2. MSE of each ability dimension under each item selection method 

5 Note: Original=Items Selection Index without using item exposure controlling strategies; 

6 D=D-optimality; K=KLP; ‘-1 ’,’-2’, and ’-3 ’denote the first, second and third dimensions. 
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Figure 3. Item pool usage and overexposed item rates for each method under different 
correlations. 
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3 Figure 4. Item exposure rates of different methods under a correlation of 0.6 for (a) the four item 


4 selection indices without item exposure control, (b) the three item exposure control methods 


5 combined with KLP. 




